47 research outputs found

    Design of Clustered Superscalar Microarchitectures

    Get PDF
    L'objectiu d'aquesta tesi és proposar noves tècniques per al disseny de microarquitectures clúster superescalars eficients. Les microarquitectures clúster particionen el disseny de diversos components crítics del hardware com a mitjà per mantenir-ne el paral·lelisme i millorar-ne l'escalabilitat. El nucli d'un processador clúster, format per blocs de baixa complexitat o clústers, pot executar cadenes d'instruccions dependents sense pagar el sobrecost d'una llarga emissió, curtcircuïts, o lectura de registres, encara que si dues instruccions dependents s'executen en clústers diferents, es paga la penalització d'una comunicació. Per altra banda, les estructures distribuïdes impliquen generalment menors requisits de potència dinàmica, i simplifiquen la gestió de l'energia per mitjà de tècniques com la desactivació selectiva del rellotge o l'energia, o com la reducció a escalade la tensió.El primer objecte d'aquesta recerca és l'assignació d'instruccions a clústers, ja que aquesta juga un paper clau en el rendiment, amb l'objectiu de mantenir equilibrada la càrrega i reduir la penalització de les comunicacions crítiques. Es proposen dos diferents enfocs: primer, una família de nous esquemes que identifiquen dinàmicament certs grups d'instruccions dependents anomenats "slices", i fan l'assignació de clústers slice per slice. Es diferencien d'altres enfocs previs, ja sigui perquè són dinàmics i/o bé perquè inclouen nous mecanismes explícits de mesura i gestió de l'equilibri de càrrega. Segon, una família de nous esquemes que assignen clústers instrucció per instrucció, basats en les assignacions prèvies dels productors dels registres fonts, en la ubicació dels registres físics, i en la càrrega de treball.La segona contribució proposa la predicció de valors com a mitjà per mitigar les penalitzacions dels retards dels connectors i, en particular, per amagar les comunicacions entre clústers. Es demostra que el benefici obtingut amb l'eliminació de dependències creix amb el nombre de clústers i la latència de les comunicacions i, doncs, és major que per a una arquitectura centralitzada. Es proposa un nou esquema d'assignació de clústers que aprofita la menor densitat del graf de dependències per tal de millorar l'equilibri de càrrega.El tercer aspecte considerat es la xarxa d'interconnexió entre clústers, ja que determina la latència de les comunicacions, amb l'objectiu de trobar el millor compromís entre cost i rendiment. Es proposen diverses xarxes punt-a-punt, tant síncrones com parcialment asíncrones, que assoleixen un IPC pròxim al d'un model ideal amb ample de banda il·limitat, tot i tenir molt baixa complexitat. Llur impacte sobre els curtcircuïts, cues d'emissió o bancs de registres es molt menor que el d'altres enfocs. Es proposen també possibles implementacions dels enrutadors, que il·lustren llur factibilitat amb solucions hardware molt simples i de baixa latència. Es proposa un nou esquema d'assignació de clústers conscient de la topologia, que redueix la latència de les comunicacions.L'última contribució proposa tècniques per distribuir els components principals de les etapes inicials del processador, amb l'objectiu de reduir-ne la complexitat i evitar-ne la replicació. Es proposen tècniques eficaces per a la partició del predictor de salts i la lògica de distribució d'instruccions, a fi de minimitzar la penalització pels retards dels connectors causada per les dependències recursives en dos llaços crítics del hardware: la generació de l'adreça de búsqueda d'instruccions i la lògica d'assignació d'instruccions, respectivament. En el primer cas, es converteixen els retards dels connectors intra-estructurals d'un predictor centralitzat en retards de comunicació entre clústers, els quals se segmenten sense problemes. En el segon cas, el particionat de la lògica d'assignació d'instruccions basada en dependències implica paral·lelitzar aquesta tasca, la qual es inherentment seqüencial.El objetivo de esta tesis es proponer técnicas para el diseño de microarquitecturas clúster superescalares eficientes. Las microarquitecturas clúster particionan el diseño de diversos componentes críticos del hardware como medio para mantener el paralelismo y mejorar la escalabilidad. El núcleo de un procesador clúster, formado por bloques de baja complejidad o clústers, puede ejectutar cadenas de instrucciones dependientes sin pagar el sobrecoste de una larga emisión, cortocircuitos, o lectura de registros; pero si dos instrucciones dependientes se ejecutan en clústers distintos, se paga la penalización de una comunicación. Por otro lado, las estructuras distribuidas implican generalmente menores requisitos de potencia dinámica, y simplifican la gestión de la energía por medio de técnicas como la desactivación selectiva del reloj o de la alimentación, o la reducción a escala del voltaje.El primer objetivo de esta investigación es la asignación dinámica de instrucciones a clústers, ya que ésta juega un papel clave en el rendimiento, a fin de mantener equilibrada la carga y reducir la penalización de las comunicaciones críticas. Se proponen dos enfoques distintos: primero, una familia de nuevos esquemas que identifican dinámicamente ciertos grupos de instrucciones denominados "slices", y realizan la asignación slice por slice. Éstos se diferencian de otros enfoques previos, ya sea porque son dinámicos y/o porque incluyen nuevos mecanismos explícitos de medida y gestión del equilibrio de carga. Segundo, una familia de nuevos esquemas que asignan clústers instrucción a instrucción, basándose en las asignaciones previas de los productores de sus registros fuente, en la ubicación de los registros físicos, y en la carga de trabajo.La segunda contribución propone la predicción de valores como medio para mitigar las penalizaciones de los retardos de los conectores, y en particular, para esconder las comunicaciones entre clústers. Se demuestra que el beneficio obtenido con la eliminación de dependencias aumenta con el número de clústers y con la latencia de las comunicaciones, y es asimismo mayor que para una arquitectura centralizada. Se propone un nuevo esquema de asignación de clústers que aprovecha la menor densidad del grafo de dependencias con el fin de mejorar el equilibrio de la carga.El tercer aspecto considerado es la red de interconexión entre clústers, pues determina la latencia de las comunicaciones, a fin de hallar el mejor compromiso entre coste y rendimiento. Se proponen diversas redes punto a punto, tanto síncronas como parcialmente asíncronas, que aun teniendo muy baja complejidad consiguen un IPC próximo al de un modelo con ancho de banda ilimitado. Su impacto sobre la complejidad de los cortocircuitos, colas de emisión o bancos de registros es mucho menor que el de otros enfoques. Se proponen también posibles implementaciones de los enrutadores, ilustrando su factibilidad como soluciones simples y de baja latencia. Se propone un esquema de asignación de clústers consciente de la topología, que reduce la latencia de las comunicaciones.La última contribución propone técnicas para distribuir los componentes principales de las etapas iniciales del procesador, con el objetivo de reducir su complejidad y evitar su replicación. Se proponen técnicas eficaces para particionar el predictor de saltos y la lógica de distribución de instrucciones, a fin de minimizar la penalización por retardos de conectores causada por las dependencias recursivas en dos bucles críticos del hardware: la generación de la dirección de búsqueda de instrucciones y la lógica de asignación de clústers. En el primer caso, los retardos de los conectores intra-estructurales de un predictor centralizado se convierten en retardos de comunicación entre clústers, que se pueden segmentar fácilmente. En el segundo caso, el particionado de la lógica de asignación de clústers basada en dependencias implica paralelizar esta tarea, intrínsecamente secuencial.The objective of this thesis is to propose new techniques to design efficient clustered superscalar microarchitectures. Clustered microarchitectures partition the layout of several critical hardware components as a means to keep most of the parallelism while improving the scalability. A clustered processor core, made up of several low complex blocks or clusters, can efficiently execute chains of dependent instructions without paying the overheads of a long issue, register read or bypass latencies. Of course, when two dependent instructions execute in different clusters, an inter-cluster communication penalty is incurred. Moreover, distributed structures usually imply lower dynamic power requirements, and simplify power management via techniques such a selective clock/power gating and voltage scaling.The first target of this research is the assignment of instructions to clusters, since it plays a major role on performance, with the goals of keeping the workload of clusters balanced and reducing the penalty of critical communications. Two different approaches are proposed: first, a family of new schemes that dynamically identify groups of data-dependent instructions called slices, and make cluster assignments on a per-slice basis. The proposed schemes differ from previous approaches either because they are dynamic and/or because they include new mechanisms to deal explicitly with workload balance information gathered at runtime. Second, it proposes a family of new dynamic schemes that assign instructions to clusters in a per-instruction basis, based on prior assignment of the source register producers, on the cluster location of the source physical registers, and on the workload of clusters.The second contribution proposes value prediction as a means to mitigate the penalties of wire delays and, in particular, to hide inter-cluster communications while also improving workload balance. First, it is proven that the benefit of breaking dependences with value prediction grows with the number of clusters and the communication latency, thus it is higher than for a centralized architecture. Second, it is proposed a cluster assignment scheme that exploits the less dense data dependence graph that results from predicting values to achieve a better workload balance.The third aspect considered is the cluster interconnect, which mainly determines communication latency, seeking for the best trade-off between cost and performance. First, several cost-effective point-to-point interconnects are proposed, both synchronous and partially asynchronous, that approach the IPC of an ideal model with unlimited bandwidth while keeping the complexity low. The proposed interconnects have much lower impact than other approaches on the complexity of bypasses, issue queues and register files. Second, possible router implementations are proposed, which illustrate their feasibility with very simple and low-latency hardware solutions. Third, a new topology-aware improvement to the cluster assignment scheme is proposed to reduce the distance (and latency) of inter-cluster communications.The last contribution proposes techniques for distributing the main components of the processor front-end with the goals of reducing their complexity and avoiding replication. In particular, effective techniques are proposed to cluster the branch predictor and the steering logic, that minimize the wire delay penalties caused by broadcasting recursive dependences in two critical hardware loops: the fetch address generation, and the cluster assignment logic, respectively. In the former case, the proposed technique converts the cross-structure wire delays of a centralized predictor into cross-cluster communication delays, which are smoothly pipelined. In the latter case, the partitioning of the instruction steering logic involves the parallelization of an inherently sequential task such as the dependence based cluster assignment of instructions

    The synergy of multithreading and access/execute decoupling

    Get PDF
    This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.Peer ReviewedPostprint (published version

    Improving latency tolerance of multithreading through decoupling

    Get PDF
    The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version

    An energy-efficient memory unit for clustered microarchitectures

    Get PDF
    Whereas clustered microarchitectures themselves have been extensively studied, the memory units for these clustered microarchitectures have received relatively little attention. This article discusses some of the inherent challenges of clustered memory units and shows how these can be overcome. Clustered memory pipelines work well with the late allocation of load/store queue entries and physically unordered queues. Yet this approach has characteristic problems such as queue overflows and allocation patterns that lead to deadlocks. We propose techniques to solve each of these problems and show that a distributed memory unit can offer significant energy savings and speedups over a centralized unit. For instance, compared to a centralized cache with a load/store queue of 64/24 entries, our four-cluster distributed memory unit with load/store queues of 16/8 entries each consumes 31 percent less energy and performs 4,7 percent better on SPECint and consumes 36 percent less energy and performs 7 percent better for SPECfp.Peer ReviewedPostprint (author's final draft

    Early register release for out-of-order processors with register windows

    Get PDF
    Register windows is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper we propose two early register release techniques that leverages register windows to drastically reduce the register requirements, and hence reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the same number as logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Leveraging register windows to reduce physical registers to the bare minimum

    Get PDF
    Register window is an architectural technique that reduces memory operations required to save and restore registers across procedure calls. Its effectiveness depends on the size of the register file. Such register requirements are normally increased for out-of-order execution because it requires registers for the in-flight instructions, in addition to the architectural ones. However, a large register file has an important cost in terms of area and power and may even affect the cycle time. In this paper, we propose a software/hardware early register release technique that leverage register windows to drastically reduce the register requirements, and hence, reduce the register file cost. Contrary to the common belief that out-of-order processors with register windows would need a large physical register file, this paper shows that the physical register file size may be reduced to the bare minimum by using this novel microarchitecture. Moreover, our proposal has much lower hardware complexity than previous approaches, and requires minimal changes to a conventional register window scheme. Performance studies show that the proposed technique can reduce the number of physical registers to the number of logical registers plus one (minimum number to guarantee forward progress) and still achieve almost the same performance as an unbounded register file.Peer ReviewedPostprint (published version

    Memory bank predictors

    Get PDF
    Cache memories are commonly implemented through multiple memory banks to improve bandwidth and latency. The early knowledge of the data cache bank that an instruction will access can help to improve the performance in several ways. One scenario that is likely to become increasingly important is clustered microprocessors with a distributed cache. This work presents a study of different cache bank predictors. We show that effective bank predictors can be implemented with relatively low cost. For instance, a predictor of approximately 4 Kbytes is shown to achieve an average hit rate of 78% for SPECint2000 when used to predict accesses to an 8-bank cache memory in a contemporary superscalar processor. We also show how a predictor can be used to reduce the communication latency caused by memory accesses in a clustered microarchitecture with a distributed cache design.Peer ReviewedPostprint (published version

    A cost-effective clustered architecture

    Get PDF
    In current superscalar processors, all floating-point resources are idle during the execution of integer programs. As previous works show, this problem can be alleviated if the floating-point cluster is extended to execute simple integer instructions. With minor hardware modifications to a conventional superscalar processor, the issue width can potentially be doubled without increasing the hardware complexity. In fact, the result is a clustered architecture with two heterogeneous clusters. We propose to extend this architecture with a dynamic steering logic that sends the instructions to either cluster. The performance of clustered architectures depends on the inter-cluster communication overhead and the workload balance. We present a scheme that uses run-time information to optimise the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35% over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms the previously proposed one.Peer ReviewedPostprint (published version

    TEAPOT: a toolset for evaluating performance, power and image quality on mobile graphics systems

    Get PDF
    In this paper we present TEAPOT, a full system GPU simulator, whose goal is to allow the evaluation of the GPUs that reside in mobile phones and tablets. To this extent, it has a cycle accurate GPU model for evaluating performance, power models for the GPU, the memory subsystem and for OLED screens, and image quality metrics. Unlike prior GPU simulators, TEAPOT supports the OpenGL ES 1.1/2.0 API, so that it can simulate all commercial graphical applications available for Android systems. To illustrate potential uses of this simulating infrastructure, we perform two case studies. We first turn our attention to evaluating the impact of the OS when simulating graphical applications. We show that the overall GPU power/performance is greatly aff ected by common OS tasks, such as image composition, and argue that application level simulation is not sufficient to understand the overall GPU behavior. We then utilize the capabilities of TEAPOT to perform studies that trade image quality for energy. We demonstrate that by allowing for small distortions in the overall image quality, a signifi cant amount of energy can be saved.Postprint (author’s final draft

    TCOR: a tile cache with optimal replacement

    Get PDF
    Cache Replacement Policies are known to have an important impact on hit rates. The OPT replacement policy [27] has been formally proven as optimal for minimizing misses. Due to its need to look far ahead for future memory accesses, it is often reduced to a yardstick for measuring the efficacy of other practical caches. In this paper, we bring the OPT to life, in architectures for mobile GPUs, for which energy efficiency is of great consequence. We also mold other factors in the memory hierarchy to enhance its impact. The end results are a 13.8% decrease in the memory hierarchy energy consumption and an increased throughput in the Tiling Engine. We also observe a 5.5% decrease in the total GPU energy and a 3.7% increase in frames per second (FPS).This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant No 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00, the ICREA Academia program and the AGAUR grant 2020-FISDU-00287. We would also like to thank the anonymous reviewers for their valuable comments.Peer ReviewedPostprint (author's final draft
    corecore